Distributed storage infrastructures require the use of data redundancy toachieve high data reliability. Unfortunately, the use of redundancy introducesstorage and communication overheads, which can either reduce the overallstorage capacity of the system or increase its costs. To mitigate the storageand communication overhead, different redundancy schemes have been proposed.However, due to the great variety of underlaying storage infrastructures andthe different application needs, optimizing these redundancy schemes for eachstorage infrastructure is cumbersome. The lack of rules to determine theoptimal level of redundancy for each storage configuration leads developers inindustry to often choose simpler redundancy schemes, which are usually not theoptimal ones. In this paper we analyze the cost of different redundancy schemesand derive a set of rules to determine which redundancy scheme minimizes thestorage and the communication costs for a given system configuration.Additionally, we use simulation to show that theoretically-optimal schemes maynot be viable in a realistic setting where nodes can go off-line and repairsmay be delayed. In these cases, we identify which are the trade-offs betweenthe storage and communication overheads of the redundancy scheme and its datareliability.
展开▼